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Abstract 

The process of determining the minimum pass level to separate the competent students from those who do not 
perform well enough is called standard setting. A large number of methods are widely used to set cut-scores for 
both written and clinical examinations. There are some challenging issues pertaining to any standard setting 
procedure. Ignoring these concerns would result in a large dispute regarding the credibility and defensibility of 
the method. The goal of this review is to provide a basic understanding of the key concepts and challenges in 
standard setting and to suggest some recommendations to overcome the challenging issues for educators and 
policymakers who are dealing with decision-making in this field. 
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Introduction 

Student assessment is an integral part of 
educational programs. Since it drives stu- 
dents' learning and highlights significant 
goals and objectives of the course, teachers 
and administrative pay careful attention to 
its different parts. However, standard set- 
ting is an area in the field of assessment 
which is not dealt with so frequently. 

A standard, also known as the minimum 
pass level, separates the competent students 
from those who are not. The process of de- 
termining this special score is called stand- 
ard setting (1). The decision to pass or fail 
an examinee is an important issue in medi- 
cal education, especially for licensure and 
credentialing purposes (2). The standard 
should not be set in an arbitrary way but it 
should be established through a specific 
methodology that considers the test's ob- 
jectives and content areas, the examinees' 



performance, and the wider social or educa- 
tional setting (3). 

A large number of methods have been 
developed and used to set standard for both 
written and clinical examinations (4). 
Standard setting methods, depending on the 
purpose of the test, can be either norm- 
referenced or criterion-referenced. Norm- 
referenced (relative) standard setting meth- 
ods are used when a fixed proportion of 
examinees are required to pass. Since the 
standard is based on the ability of the co- 
hort of students, it is possible that some 
competent candidates would fail the exam. 
The criterion-referenced (absolute) meth- 
ods, such as Angoff or borderline regres- 
sion, deal with the desirable competency 
level that each student should achieve. So, 
hypothetically, all examinees may pass or 
fail a test with an absolute standard (2). 

Each of the methods serves a particular 
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purpose and none is agreed upon as the best 
method or gold standard for all settings (5). 
Many published studies have addressed this 
topic by delineating practical steps of vari- 
ous standard setting procedures. Further- 
more, literature abounds with papers report- 
ing the application of different standard set- 
ting methods and comparing their results in 
terms of obtained cut score, pass rates and 
the degree of error in the process. Providing 
a detailed description of the existing tech- 
niques is beyond the scope of this manu- 
script and can be found elsewhere in the 
medical education literature (6-1 1). 

The goal of this review is to provide a 
better understanding of standard setting for 
educators and policymakers who are deal- 
ing with decision making in this tield by 
focusing specifically on the challenging 
issues surrounding this topic. We will also 
discuss some possible solutions and sug- 
gestions to overcome these problems, hence 
obtaining more credible results. 

Areas of concern 

While each of the standard setting proce- 
dures possesses their unique specifications, 
they all share some challenging issues 
which might occur to anyone who is en- 
gaged in standard setting. Ignoring these 
concerns during the procedure may result in 
a large dispute regarding the credibility and 
defensibility of the method (3,5,12). These 
challenging issues include, but are not lim- 
ited to, the following: the subjective nature 
of the standard setting, the definition of a 
minimally competent student, and the vari- 
ability in standard setting results. 

The subjective nature of the standard 
setting 

One of the very first challenges in setting 
standards is that all of the methods require 
the application of "judgment" (13,14). In 
some methods, experts are asked to esti- 
mate the probability that a borderline can- 
didate would correctly answer test items. 
Others require judges to observe and evalu- 
ate students' performance during the exam- 



ination. In both procedures, the central and 
important role of judgment cannot be ig- 
nored (4). Because standards are an expres- 
sion of subjective values, critics claim that 
they are not valid. It is important to consid- 
er, however, that no purely objective meth- 
od for determining the cut-score exists (13). 
In other words, although particular statisti- 
cal and mathematical methods are used as 
part of some standard setting approaches, 
there are no true cut-scores that can be 
achieved through application of a perfectly 
objective method. It should also be noted 
that human judgment plays a fundamental 
role in every level of student assessment 
and not merely in standard setting (14). 
Some of the issues reflecting the judgments 
of test takers include choosing type of item, 
establishing what questions to ask, writing 
and editing questions, selecting the best 
option in cued questions, and scoring con- 
structed-response questions. It seems that 
the role of judgment in test development is 
accepted without difficulty while concerns 
about the subjective nature of standard set- 
ting are overemphasized. 

The definition of a borderline student 

Another important challenge in standards 
setting is the definition of the "borderline" 
student. Although application of this con- 
cept is more pronounced in some methods 
such as Angoff, in which judges should en- 
visage a borderline candidate and estimate 
their performance, understanding the char- 
acteristics of such a student, is the comer- 
stone of almost all methods. It is frequently 
stated that the cognitive task of considering 
a borderline candidate is highly demanding 
even for the experts, to a degree which may 
impair their judgments. This is especially 
true if judges' concepts change from one 
item to another according to discussions or 
mental fatigue throughout the process (3). 
It has been noted that judges, in an effort to 
facilitate the creation of this conceptual im- 
age, think about an average student instead 
of focusing on the borderline performer, 
leading to the substitution of a criterion- 
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based concept with a norm-referenced one 
(13). 

This issue is closely related to the general 
decision on students' proficiency levels. 
The classification of students' performance 
may be limited just to competent or incom- 
petent, or might be labeled into 5 or 6 cate- 
gories, each designating a certain level of 
competency with borderline performance 
lying somewhere in the continuum (13-15). 
While there is no universally-agreed rule 
for the number and definition of these lev- 
els, serious problems arise when judges try 
to explain the borderline category and justi- 
fy its location on the scale. 

Variability of cut-scores 

Another criticism aimed at the credibility 
of standard setting is the variability of ob- 
tained standards. As the literature reveals, 
variability in standard-setting results using 
different techniques, or even across replica- 
tions of the same procedure, can be large 
(7-11), adding weight to the argument that 
these methods cannot be trusted to distin- 
guish competent students from non- 
competent candidates. Generally, when 
pass/fail decisions are made in an examina- 
tion, two kinds of errors may lead in mis- 
classification of students: the error associ- 
ated with the test score and the error related 
to the determined standard (3). In fact, var- 
iability in observed scores can occur in any 
kind of repeated measurement and it is not 
limited to standard setting (14). It is not 
unusual for a student to take two so-called 
parallel exams and achieve two different 
scores. Nichols et al. argue that although 
both standard setting and student assess- 
ment lay in the field of measurement, they 
are not exactly the same. While the former 
should be regarded as a stimulus -centered 
approach, in which higher reliability will be 
obtained if the variance associated with 
items is large and the variance associated 
with persons is small, the latter is often 
treated like a subject-centered approach in 
which the higher reliability will be obtained 
if the variance associated with persons is 
large and the variance associated with items 



is small (14). 

Suggestions for improvement 

While the above-mentioned challenges 
are inherent to the procedure, several sug- 
gestions may reduce the concerns and en- 
hance the outcome. Some of these recom- 
mendations need to be followed before set- 
ting the standard and some should be ap- 
plied afterwards. Most of them can be 
adapted irrespective of the method selected 
for determining the cut score. 

Selection of appropriate judges 

The number and nature of the judges are 
central to the credibility of the standard. 
Judges have different cut scores in mind 
due to difference in their educational back- 
ground, professional role, socioeconomic 
status, as well as their knowledge, experi- 
ence, and opinions relating to the standard 
setting method (5,12,13). 

In Angoff, Ebel, and Nedelsky, where 
formation of a panel of specialists is re- 
quired, involvement of an appropriate 
number and mixture of the judges to in- 
clude a variety of viewpoints and to gener- 
ate acceptable results, is of paramount im- 
portance (12,13). 

Although the exact number of the panel- 
ists required is still controversial and stud- 
ies have yielded results as low as 5 and as 
high as 20, most suggestions revolve 
around a group of 10 judge as suitable for 
this purpose (16-18). Furthermore, factors 
such as the method of standard setting, the 
content area of the exam, and the presence 
(or absence) of group discussion or reality 
checks vary among these studies, limiting 
the generalizability of their findings. 

The judges should also be good repre- 
sentatives of the relevant experts and 
should be selected meticulously, consider- 
ing their age, gender, ethnicity, and educa- 
tional experience. 

Defining performance level and charac- 
teristics of a borderline student 

Before a method is selected, the stake- 
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holders, that may or may not be different 
from the judges, should decide on students' 
performance levels including number of 
categories, their labels, and a behavioral 
descriptor for each category (13-15). Since 
most methods require judging the perfor- 
mance of a borderline student, development 
of criteria relating to minimally accepted 
competency is an important step. Detailed 
descriptors should demonstrate the 
knowledge, skills, and abilities in a specific 
context that are expected from a candidate 
in that category. 

Training of the judges 

Training the judges on the selected meth- 
od, including the opportunity for practice, 
discussion, and feedback, is critically im- 
portant. Bearing in mind the second chal- 
lenge, it is essential to provide judges with 
the performance levels descriptors, and 
then let them reach a deep understanding 
through discussion with other panelists 
(3,14). Characterizing the borderline stu- 
dents by creating a list of relevant skills 
measured in the test, can help judges to 
reach a consensus (19). 

Assessing the reliability of standard set- 
ting 

As mentioned earlier, variability in cut 
scores obtained by different standard set- 
ting methods or on different occasions is 
inevitable. A frequently used framework to 
interpret this variability is the reliability or 
consistency of the results. As reliability es- 
timates are used to acknowledge and delin- 
eate the magnitude of the error inherent in 
student assessment, a similar approach can 
be adapted to quantify the error component 
of the cut-score. In other words, by replica- 
tion of the procedure or conducting another 
method or using another panel of judges, 
how consistent the cut-score would be or 
what proportion of students would be clas- 
sified similarly. The more reliable a meth- 
od, the less likely the results will be affect- 
ed by large random errors. 

Reliability can be calculated using Classi- 



cal test theory (CTT) or Generalizability 
theory (GT). Under CTT, an observed score 
on a measurement is the sum of the true 
score and the error component. Sources of 
error in standard setting include different 
panelists, different context, and different 
occasions in which judgments occur 
(14,20). In contrast to CTT, which consid- 
ers error to be unitary, GT can determine 
the contribution of all sources of variance 
at the same time. The intent of a G-study in 
this context is to differentiate among items 
while generalizing results over judges. But 
caution must be exercised in interpreting 
the reliability coefficient since it might be 
influenced by one judge who dominates 
others or endorses a shared misconception 
among panelists (5,6). In this way, higher 
reliability coefficients no longer reflects 
judges' true perceptions or expectations. 

It should, however, be noted that reliabil- 
ity does not tell us about the meaningful- 
ness of the standard and does not guarantee 
its appropriateness for the given purpose. 
This issue will be dealt with in greater de- 
tail in the forthcoming paragraphs. 

Ensuring the validity of standard setting 

The standard setting aims at dividing 
candidates into mastery and non-mastery 
categories and the validity of standard set- 
ting, also known as the credibility, deals 
with how well this task has been accom- 
plished. A procedure that misclassifies a 
non-competent student as competent (false 
positive) or vice versa lacks accuracy. 

One method to measure the validity of 
standard setting is to follow the students' 
performance in future. If the competent 
students show acceptable behavior in their 
workplaces, the standard will prove credi- 
ble. However, in this design, it is impossi- 
ble to compare the performance of compe- 
tent and non-competent students because 
the latter are usually not permitted to pur- 
sue practice. Another method is to compare 
pass/fail rates of one test with that of other 
concurrent exams. 

It is important to keep in mind that the 
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above-mentioned approaches do not prove 
the vahdity of the standard itself (i.e. 45 or 
55 or ...) since the 'true' cut-score does not 
exist. We might at best try to ensure that 
the chosen method is appropriate and can 
give rise to sound decisions. The appropri- 
ateness of the method is also supported 
when evidence of defensible process is 
demonstrated. It is evident that setting 
standards by gathering the judgments of a 
group of experts in an unbiased way, and 
with consideration of the level of the exam- 
inees and the content of the exam, makes 
more sense than relying only on a fixed 
pre-defined arbitrary score. For this reason, 
careful documentation of the whole pro- 
cess, including number and characteristics 
of experts, as well as collecting comments 
of judges and stakeholders about credibility 
of the results, should be considered. How- 
ever, it should be noted that an appropriate- 
ly set standard may make the pass/fail deci- 
sions defensible, but there is no conclusive 
way to ensure the validity of any standard- 
setting method and relying only on proce- 
dural evidence, provides weak justification 
for the credibility of the decisions. 

Conclusion 

Standard setting in the medical profession 
is still in an evolutionary stage. While vari- 
ous approaches have been developed, there 
are still many concerns regarding this pro- 
cess. Although these challenges cannot be 
fully eliminated, ensuring the quality of the 
standard setting, which can be accom- 
plished by taking some of the steps men- 
tioned in this manuscript, is of paramount 
importance. The information obtained 
through this quality assurance may be help- 
ful in interpreting the standards and can 
also prove that, in spite of the variability of 
scores, the pass/fail decisions are defensible 
and reasonable. 
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